Current Issue : October - December Volume : 2013 Issue Number : 4 Articles : 4 Articles
We describe a heuristic scheduling approach for optimizing floating-point pipelines subject to input port constraints. The objective\r\nof our technique is to maximize functional unit reuse while minimizing the following performance metrics in the generated circuit:\r\n(1) maximum multiplexer fanin, (2) datapath fanout, (3) number of multiplexers, and (4) number of registers. For a set of systems\r\nbiology markup language (SBML) benchmark expressions, we compare the resource usages given by our method to those given by\r\na branch-and-bound enumeration of all valid schedules. Compared with the enumeration results, our heuristic requires on average\r\n33.4% lessmultiplexer bits and 32.9% less register bits than the worse case, while only requiring 14% moremultiplexer bits and 4.5%\r\nmore register bits than the optimal case. We also compare our results against those given by the state-of-art high-level synthesis\r\ntool Xilinx AutoESL. For the most complex of our benchmark expressions, our synthesis technique requires 20% less FPGA slices\r\nthan AutoESL....
Decimal floating point operations are important for applications that cannot tolerate errors from conversions between binary and\r\ndecimal formats, for instance, commercial, financial, and insurance applications. In this paper we present five different radix-10\r\ndigit recurrence dividers for FPGA architectures.The first one implements a simple restoring shift-and-subtract algorithm, whereas\r\neach of the other four implementations performs a nonrestoring digit recurrence algorithm with signed-digit redundant quotient\r\ncalculation and carry-save representation of the residuals.More precisely, the quotient digit selection function of the second divider\r\nis implemented fully by means of a ROM, the quotient digit selection function of the third and fourth dividers are based on carrypropagate\r\nadders, and the fifth divider decomposes each digit into three components and requires neither a ROMnor amultiplexer.\r\nFurthermore, the fixed-point divider is extended to support IEEE 754-2008 compliant decimal floating-point division for decimal64\r\ndata format. Finally, the algorithms have been synthesized on a Xilinx Virtex-5 FPGA, and implementation results are given....
The nonlinear vector precoding (VP) technique has been proven to achieve close-to-capacity performance in multiuser multipleinput\r\nmultiple-output (MIMO) downlink channels. The performance benefit with respect to its linear counterparts stems from\r\nthe incorporation of a perturbation signal that reduces the power of the precoded signal. The computation of this perturbation\r\nelement, which is known to belong in the class of NP-hard problems, is the main aspect that hinders the hardware implementation\r\nof VP systems. To this respect, several tree-search algorithms have been proposed for the closest-point lattice search problem in VP\r\nsystems hitherto. Nevertheless, the optimality of these algorithms has been assessedmainly in terms of error-rate performance and\r\ncomputational complexity, leaving the hardware cost of their implementation an open issue.Theparallel data-processing capabilities\r\nof field-programmable gate arrays (FPGA) and the loopless nature of the proposed tree-search algorithms have enabled an efficient\r\nhardware implementation of a VP system that provides a very high data-processing throughput....
The ability to map instructions running in a microprocessor to a reconfigurable processing unit (RPU), acting as a coprocessor, enables the runtime acceleration of applications and ensures code and possibly performance portability. In this work, we focus on the mapping of loop-based instruction traces (called Megablocks) to RPUs. The proposed approach considers offline partitioning and mapping stages without ignoring their future runtime applicability. We present a toolchain that automatically extracts specific trace-based loops, called Megablocks, from MicroBlaze instruction traces and generates an RPU for executing those loops. Our hardware infrastructure is able to move loop execution from the microprocessor to the RPU transparently, at runtime, and without changing the executable binaries. The toolchain and the system are fully operational. Three FPGA implementations of the system, differing in the hardware interfaces used, were tested and evaluated with a set of 15 application kernels. Speedups ranging from 1.26x to 3.69x were achieved for the best alternative using a MicroBlaze processor with local memory....
Loading....